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Abstract 

Mutual information is widely used in artificial intelli- 
gence, in a descriptive way, to measure the stochastic de- 
pendence of discrete random variables. In order to address 
questions such as the reliability of the empirical value, one 
must consider sample-to-population inferential approaches. 
This paper deals with the distribution of mutual informa- 
tion, as obtained in a Bayesian framework by a second-order 
Dirichlet prior distribution. The exact analytical expres- 
sion for the mean and an analytical approximation of the 
variance are reported. Asymptotic approximations of the 
distribution are proposed. The results are applied to the 
problem of selecting features for incremental learning and 
classification of the naive Bayes classifier. A fast, newly 
defined method is shown to outperform the traditional ap- 
proach based on empirical mutual information on a num- 
ber of real data sets. Finally, a theoretical development 
is reported that allows one to efficiently extend the above 
methods to incomplete samples in an easy and effective way. 
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Robust feature selection, naive Bayes classifier, Mutual 
Information, Cross Entropy, Dirichlet distribution, Second 
order distribution, expectation and variance of mutual in- 
formation. 



1 INTRODUCTION 

The mutual information I (also called cross entropy 
or information gain) is a widely used information- 
theoretic measure for the stocha s tic dep e ndency of dis- 
crete random variables | Kul68 , CT9l| , 5oo00 |. It is 
used, for instance, in learning Bayesian nets [ CL68| , 
Pea88 , Bun96 , Hec98 |, where stochastically dependent 
nodes shall be connected; it is used to induce classifica- 



tion trees [Qui93|. It is also used to select features for 
classification problems [DHS01|, i.e. to select a subset 
of variables by which to predict the class variable. This 
is done in the context of a filter approach that discards 
irrelevant features on the basis of low values of mutual 



information with the class ]Lew92| , |BL97| , |CHH+02 l. 

The mutual information (see the definition in Sec- 
tion ||) can be computed if the joint chances 7Tjj of two 
random variables i and j are known. The usual proce- 
dure in the common case of unknown chances 7Ty is to 
use the empirical probabilities itij (i.e. the sample rela- 
tive frequencies: ^n^) as if they were precisely known 
chances. This is not always appropriate. Furthermore, 
the empirical mutual information I (jr) does not carry 
information about the reliability of the estimate. In 
the Bayesian framework one can address these ques- 
tions by using a (second order) prior distribution p(ir), 
which takes account of uncertainty about n. From the 
prior p(7r) and the likelihood one can compute the pos- 
terior p(7r|n), from which the distribution p(/|n) of the 
mutual information can in principle be obtained. 

This paper reports, in Section |2Tj , the exact analyti- 
cal mean of / and an analytical O (n~ 3 )-approximation 
of the variance. These are reliable and quickly com- 
putable expressions following from p(/|n) when a 
Dirichlet prior is assumed over n. Such results allow 
one to obtain analytical approximations of the distri- 
bution of /. We introduce asymptotic approximations 
of the distribution in Section |2.2| , graphically showing 
that they are good also for small sample sizes. 

The distribution of mutual infor matio n is then ap- 
plied to feature selection. Section 3.1 proposes two 



Section 

new filters that use credible intervals to robustly esti- 
mate mutual information. The filters are empirically 
tested, in turn, by coupling them with the naive Bayes 
classifier to incrementally learn from and classify new 
data. On ten real data sets that we used, one of the 
two proposed filters outperforms the traditional filter: 
it almost always selects fewer attributes than the tradi- 
tional one while always leading to equal or significantly 
better prediction accuracy of the classifier (Section ^). 
The new filter is of the same order of computational 
complexity as the filter based on empirical mutual in- 
formation, so that it appears to be a significant im- 



provement for real applications. 

The proved importance of the distribution of mutual 
information led us to extend the mentioned analytical 
work towards even more effective and applicable meth- 
ods. Section 5.1 proposes improved analytical approxi- 



mations for the tails of the distribution, which arc often 
a cr itical point for asymptotic approximations. Section 
5.2 allows the distribution of mutual information to be 
computed also from incomplete samples. Closed-form 
formulas are developed for the case of feature selection. 

2 DISTRIBUTION OF MU- 
TUAL INFORMATION 

Consider two discrete random variables i and j taking 
values in {l,...,r} and {1, ...,s}, respectively, and an 
i.i.d. random process with samples £ {1, ■ ■■,r} x 
{1, s} drawn with joint chances nij. An important 
measure of the stochastic dependence of i and j is the 
mutual information: 



7 W = J2 w *i lo S 



i=l ] = 1 



(1) 



where log denotes the natural logarithm and 7Ti + = 
J2j and n + j = J2i n ij are marginal chances. Of- 
ten the chances 7Ty are unknown and only a sam- 
ple is available with m 3 - outcomes of pair (i,j). The 
empirical probability 7^- = — ^ may be used as a 
point estimate of 7Ty , where n = y\ . ny is the to- 
tal sample size. This leads to an empirical estimate 

for the mutual information. 



j(tt) = y..^\ og ji 

Unfortunately, the point estimation I(tt) carries no 
information about its accuracy. In the Bayesian ap- 
proach to this problem one assumes a prior (second or- 
der) probability density p(ir) for the unknown chances 
Wij on the probability simplex. From this one can com- 
pute the posterior distribution p(7r|n) oc p(n) ir"* 1 
(the multinomially distributed) and define the 

posterior probability density of the mutual informa- 
tion^ 



p(/|n) 



6(I(ir) - I)p(ir\n)d rs ir. 



(2) 



gThe S(-) distribution restricts the integral to 7r for 
which I(tt) — I. For large sample size n — > 00, p(7r|n) is 

1 I(tt) denotes the mutual information for the specific chances 
7r, whereas I in the context above is just some non-negative real 
number. / will also denote the mutual information random vari- 
able in the expectation E[I] and variance Var[7]. Expectations 
are always w.r.t. to the posterior distribution p(7r|n). 

2 Since < I(tt) < Imax with sharp upper bound I m ax = 
minjlog r, log s}, the integral may be restricted to I mam , which 
shows that the domain of p(7|n) is [0, Imax]- 



strongly peaked around n — tt and p(J|n) gets strongly 
peaked around the frequency estimate I = I(tt). 

2.1 Results for / under Dirichlet 
P(oste)riors 

Many non-informative priors lead to a Dirichlet poste- 
rior distribution p(7r|n) oc TT.. tt" 1 - 

1 A ' J t J 



tion m 



where 



1 with interpreta- 
are the number of sam- 



pies (i,j), and n'L comprises prior information (1 for 
the uniform prior, i for Jeffreys' prior, for Haldane's 



prior, — for Perks' prior JGCSR95| ). In principle this 
allows the posterior density p(I\n) of the mutual infor- 
mation to be computed. 

We focus on the mean E[I] = J °° Ip(I\n) dl = 
J I(ir)p(Tr\n)d rs ir and the variance Var[J] = E[(I — 
Sf/]) 2 ]. Eq. (||) reports the exact mean of the mu- 
tual information: 



E[I] = rHj Mrkj + 1) - ip{m+ + 1) 



-il>(n +j + l) + Tp(n + l)], 



(3) 



where ip is the ^-function that for integer arguments 

is ip(n + 1) = -7 + ELi i = lo S n + and 7 is 

Euler's constant. The approximate variance is given 
below: 



Var[7] 



M + (r — 1) (s — 1) (| — J) — Q 



(n+l)(n + 2) 



(4) 



where 



i+ n +3 



J = £^1 



Tli j , Tli j fl 



M = £ + - n «bg— 2— . 

^jf V ",, n i+ n + j n) n l+ n +j 



The results are derived in | Hut01|. The result for 
the mean was also reported in WW95 |, Theorem 10. 
We are not aware of similar analytical approximations 
for the variance. WW95 express the exact variance as 
an infinite sum, but this does not allow a straightfor- 
ward systematic approximation to be obtained. [ Kle99| 
used heuristic numerical methods to estimate the mean 
and the variance. However, the heuristic estimates are 



2 



incorrect, as it follows from the comparison with t 



analytical results provided here (see [HutOl]). 

Let us consider two further points. First, the co 
plexity to compute the above expressions is of the sai 
order 0(rs) as for the empirical mutual informati 
(H). All quantities needed to compute the mean a 
the variance involve double sums only, and the functi 
ip can be pre-tabled. 

Secondly, let us briefly consider the quality of t 
approximation of the variance. The expression for t 
exact variance has been Taylor-expanded in (—) 



produce (HI) , so the relative error 



Var[J]„ 



-Var[/] e 



Var[/]< 
v 2 



of the approximation is of the order , if i an< 

are dependent. In the opposite case, the O te: 
in the sum drops itself down to order n~ 2 resulting 
in a reduced relative accuracy O (~) of (|4|). These re- 
sults were confirmed by numerical experiments that we 
realized by Monte Carlo simulation to obtain "exact" 
values of the variance for representative choices of 7Tjj , 
r, s, and n. 



2.2 Approximating the Distribution 

Let us now consider approximating the overall distri- 
bution of mutual information based on the formulas for 

Fitting 



the mean and the variance given in Section 2.1 



a normal distribution is an obvious possible choice, as 
the central limit theorem ensures that p(I\n) converges 
to a Gaussian distribution with mean E[I] and variance 
Var[/]. Since I is non- negative, it is also worth consid- 
ering the approximation of p(/|7r) by a Gamma (i.e., 
a scaled x 2 )- Even better, as I can be normalized in 
order to be upper bounded by 1, the Beta distribution 
seems to be another natural candidate, being defined 
for variables in the [0, 1] real interval. Of course the 
Gamma and the Beta are asymptotically correct, too. 

We report a graphical comparison of the different 
approximations by focusing on the special case of bi- 
nary random variables, and on three possible vectors 
of counts. Figure |l| compares the exact distribution of 
mutual information, computed via Monte Carlo sim- 
ulation, with the approximating curves. The figure 
clearly shows that all the approximations are rather 
good, with a slight preference for the Beta approxima- 
tion. The curves tend to do worse for smaller sample 
sizes — as it is was expected — . Higher moments com- 
puted in |Hut01] may be used to improve the accuracy. 
A method to specifically improve the tail approxima- 
tion is given in Section |5.l[ 




O..l_max=[log(min(r,s))] 



Figure 1: Distribution of mutual information for two 
binary random variables (The labeling of the hori- 
zontal axis is the percentage of 7_max.J There are 
three groups of curves, for different choices of counts 
(rin, ni2, n2i, 7122). The upper group is related to the 
vector (40,10,20,80), the intermediate one to the vec- 
tor (20,5,10,40), and the lower group to (8,2,4,16). 
Each group shows the "exact" distribution and three 
approximating curves, based on the Gaussian, Gamma 
and Beta distributions. 



3 FEATURE SELECTION 



Classification is one of the most important techniques 



for knowledge discovery in databases [ DHS01 |. A clas- 
sifier is an algorithm that allocates new objects to one 
out of a finite set of previously defined groups (or 
classes) on the basis of observations on several char- 
acteristics of the objects, called attributes or features. 
Classifiers can be learnt from data alone, making ex- 
plicit the knowledge that is hidden in raw data, and 
using this knowledge to make predictions about new 
data. 

Feature selection is a basic step in the process of 
building classifiers [|BL97|, pL97|, [LM98[. In fact, even 



if theoretically more features should provide one with 
better prediction accuracy (i.e., the relative number of 
correct predictions), in real cases it h as been observed 
many times that this is not the case KS96 |. This de- 
pends on the limited availability of data in real prob- 
lems: successful models seem to be in good balance of 
model complexity and available information. In facts, 
feature selection tends to produce models that are sim- 
pler, clearer, computationally less expensive and, more- 
over, providing often better prediction accuracy. Two 
major approaches to feature selection are commonly 
used |JKP94|: filter and wrapper models. The filter 
approach is a preprocessing step of the classification 
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task. The wrapper model is computationally heavier, 
as it implements a search in the feature space. 

3.1 The Proposed Filters 

From now on we focus our attention on the filter ap- 
proach. We consider the well-known filter (F) that 
computes the empirical mutual information between 
featur es and the class, and discards low-valued fea- 
tures [Lew92|. This is an easy and effective approach 
that has gained popularity with time. Cheng reports 
that it is particularly well suited to jointly work with 
Bayesian network classifiers, an approach by which he 
won the 2001 international knowledge discovery com- 
petition JCHH+02| . The "Weka" data mining package 
implements it as a standard system tool (see [WF99], 
p. 294). 

A problem with this filter is the variability of the 
empirical mutual information with the sample. This 
may allow wrong judgments of relevance to be made, 
as when features are selected by keeping those for which 
mutual information exceeds a fixed threshold e. In or- 
der for the selection to be robust, we must have some 
guarantee about the actual value of mutual informa- 
tion. 

We define two new filters. The backward filter (BF) 
discards an attribute if its value of mutual informa- 
tion with the class is less than or equal to e with given 
(high) probability p. The forward filter (FF) includes 
an attribute if the mutual information is greater than 
e with given (high) probability p. BF is a conserva- 
tive filter, because it will only discard features after 
observing substantial evidence supporting their irrele- 
vance. FF instead will tend to use fewer features, i.e. 
only those for which there is substantial evidence about 
them being useful in predicting the class. 

The next sections present experimental comparisons 
of the new filters and the original filter F. 



EXPERIMENTAL 
SES 



ANALY- 



For the followin g experiments we use the naive Bayes 
classifier [DH73]. This is a good classificatio n mode l — 
despite its simplifying assumptions, see [DP97] — , 
which often competes successfully with the state-of- 
the-art c lassifie rs from the machine learning field, such 
as C4.5 | Qui93| . The experiments focus on the incre- 
mental use of the naive Bayes classifier, a natural learn- 
ing process when the data are available sequentially: 
the data set is read instance by instance; each time, 
the chosen filter selects a subset of attributes that the 
naive Bayes uses to classify the new instance; the naive 



Bayes then updates its knowledge by taking into con- 
sideration the new instance and its actual class. The 
incremental approach allows us to better highlight the 
different behaviors of the empirical filter (F) and those 
based on credible intervals on mutual information (BF 
and FF). In fact, for increasing sizes of the learning set 
the filters converge to the same behavior. 

For each filter, we are interested in experimentally 
evaluating two quantities: for each instance of the data 
set, the average number of correct predictions (namely, 
the prediction accuracy) of the naive Bayes classifier up 
to such instance; and the average number of attributes 
used. By these quantities we can compare the filters 
and judge their effectiveness. 

The implementation details for the following experi- 
ments include: using the Beta approximation (Section 
2.2) to the distribution of mutual information; using 
the uniform prior for the naive Bayes classifier and 
all the filters; using natural logarithms everywhere; 
and setting the level p of the posterior probability to 
0.95. As far as e is concerned, we cannot set it to 
zero because the probability that two variables are in- 
dependent (/ = 0) is zero according to the inferential 
Bayesian approach. We can interpret the parameter 
e as a degree of dependency strength below which at- 
tributes are deemed irrelevant. We set e to 0.003, in 
the attempt of only discarding attributes with negli- 
gible impact on predictions. As we will see, such a 
low threshold can nevertheless bring to discard many 
attributes. 

4.1 Data Sets 

Table |l| lists the 10 data sets used in the experiments. 
These are real data sets on a number of different do- 
mains. For example, Shuttle-small reports data on di- 
agnosing failures of the space shuttle; Lymphography 
and Hypothyroid are medical data sets; Spam is a body 
of e-mails that can be spam or non-spam; etc. 

The data sets presenting non-nominal features have 
been pre-discretized by MLC++ [KJL+94|, default op- 
tions. This step may remove some attributes judging 
them as irrelevant, so the number of features in the 
table refers to the data sets after the possible discretiza- 
tion. The instances with missing values have been dis- 
carded, and the third column in the table refers to the 
data sets without missing values. Finally, the instances 
have been randomly sorted before starting the experi- 
ments. 

4.2 Results 

In short, the results show that FF outperforms the 
commonly used filter F, which in turn, outperforms 
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Table 1: Data sets used in the experiments, together 
with their number of features, of instances and the rel- 
ative frequency of the majority class. All but the Spam 
data sets are available fro m the UCI repository of ma- 
chine learning data sets MA95 /. The Spam data set 



is described in j AKC^od^ and available from Androut- 
sopoulos's web page. 



Name 



# feat. # inst. maj. class 



Australian 36 690 0.555 

Chess 36 3196 0.520 

Crx 15 653 0.547 

German-org 17 1000 0.700 

Hypothyroid 23 2238 0.942 

Led24 24 3200 0.105 

Lymphography 18 148 0.547 

Shuttle-small 8 5800 0.787 

Spam 21611 1101 0.563 

Vote 16 435 0.614 



the filter BF. FF leads either to the same prediction 
accuracy as F or to a better one, using substantially 
fewer attributes most of the times. The same holds for 
F versus BF. 

In particular, we used the two-tails paired t test at 
level 0.05 to compare the prediction accuracies of the 
naive Bayes with different filters, in the first k instances 
of the data set, for each k. 

On eight data sets out of ten, both the differences be- 
tween FF and F, and the differences between F and BF, 
were never statistically significant, despite the often- 
substantial different number of used attributes, as from 
Table |. 

The remaining cases are described by means of the 
following figures. Figure || shows that FF allowed the 
naive Bayes to significantly do better predictions than 
F for the greatest part of the Chess data set. The max- 
imum difference in prediction accuracy is obtained at 
instance 422, where the accuracies are 0.889 and 0.832 
for the cases FF and F, respectively. Figure || does 
not report the BF case, because there is no significant 
difference with the F curve. The good performance 
of FF was obtained using only about one third of the 
attributes (Table ^) . 

Figure || compares the accuracies on the Spam data 
set. The difference between the cases FF and F is sig- 
nificant in the range of instances 32-413, with a max- 
imum at instance 59 where accuracies are 0.797 and 
0.559 for FF and F, respectively. BF is significantly 
worse than F from instance 65 to the end. This ex- 
cellent performance of FF is even more valuable con- 
sidered the very low number of attributes selected for 
classification. In the Spam case, attributes are binary 



Table 2: Average number of attributes selected by the 
filters on the entire data set, reported in the last three 
columns. The second column from left reports the orig- 
inal number of features. In all but one case, FF selected 
fewer features than F, sometimes much fewer; F usu- 
ally selected much fewer features than BF, which was 
very conservative. Boldface names refer to data sets 
on which prediction accuracies where significantly dif- 
ferent. 



Data set 



# feat. FF 



BF 



Australian 36 32.6 34.3 35.9 

Chess 36 12.6 18.1 26.1 

Crx 15 11.9 13.2 15.0 

German-org 17 5.1 8.8 15.2 

Hypothyroid 23 4.8 8.4 17.1 

Led24 24 13.6 14.0 24.0 

Lymphography 18 18.0 18.0 18.0 

Shuttle-small 8 7.1 7.7 8.0 

Spam 21611 123.1 822.0 13127.4 

Vote 16 14.0 15.2 16.0 



and correspond to the presence or absence of words in 
an e-mail and the goal is to decide whether or not the e- 
mail is spam. All the 21611 words found in the body of 
e-mails were initially considered. FF shows that only 
an average of about 123 relevant words is needed to 
make good predictions. Worse predictions are made 
using F and BF, which select, on average, about 822 
and 13127 words, respectively. Figure || shows the av- 
erage number of excluded features for the three filters 
on the Spam data set. FF suddenly discards most of 
the features, and keeps the number of selected features 
almost constant over all the process. The remaining 
filters tend to such a number, with different speeds, 
after initially including many more features than FF. 

In summary, the experimental evidence supports the 
strategy of only using the features that are reliably 
judged as carrying useful information to predict the 
class, provided that the judgment can be updated as 
soon as new observations are collected. FF almost al- 
ways selects fewer features than F, leading to a predic- 
tion accuracy at least as good as the one F leads to. 
The comparison between F and BF is analogous, so FF 
appears to be the best filter and BF the worst. How- 
ever, the conservative nature of BF might turn out to 
be successful when data are available in groups, mak- 
ing the sequential updating be not viable. In this case, 
it does not seem safe to take strong decisions of exclu- 
sion that have to be maintained for a number of new 
instances, unless there is substantial evidence against 
the relevance of an attribute. 
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a> 0.9 





Instance number 
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Figure 2: Comparison of the prediction accurac 
the naive Bayes with filters F and FF on the Chess 
data set. The gray area denotes differences that are 
not statistically significant. 



5 EXTENSIONS 



5.1 Tails Approximation 

The expansion of p(I\n) around the mean can be a 
poor estimate for extreme values I ~ or I « I ma x 
and it is better to use tail approximations. The scal- 
ing behavior of p(I\n) can be determined in the fol- 
lowing way: I(tt) is small iff Tr tJ describes near inde- 
pendent random variables i and j. This suggests the 



rcparamctcrization tt^ — tt^tt 



+j + Aij in the integral 
(|2|). Only small A can lead to small I(tt). Hence, for 
small I we may expand I (it) in A in expression (^). 
Correctly taking into account the constraints on A, a 
scaling argument shows that p(I\n) ~ J2( r ~ 1 )( s - 1 )- 1 , 
Similarly we get the scaling behavior of p(I\n) around 
I ~ Imax = minjlogr, logs}. I(ir) can be written as 
H(i) — H(i\j), where H is the entropy. Without loss 
of generality r < s. If the prior p(7r|n) converges to 
zero for Try — > sufficiently rapid (which is the case 
for the Dirichlct for not too small n), then H(i) gives 
the dominant contribution when / — > I max . The scal- 

T--3 

ing behavior turns out to be p(I m ax — Ic\n) ~ I c 2 ■ 
These expressions including the proportionality con- 
stants in case of the Dirichlct distribution are derived 



in the journal version |HZ02| 



Figure 3: Prediction accuracies of the naive Bayes with 
filters F, FF and BF on the Spam data set. The dif- 
ferences between F and FF are significant in the range 
of observations 32-413. The differences between F and 
BF are significant from observations 65 to the end (this 
significance is not displayed in the picture). 



5.2 Incomplete Samples 

In the following we generalize the setup to include the 
case of missing data, which often occurs in practice. 
For instance, observed instances often consist of sev- 
eral features plus class label, but some features may 
not be observed, i.e. if i is a feature and j a class la- 
bel, from the pair (i,j) only j is observed. We extend 
the contingency table riy to include n?j , which counts 
the number of instances in which only the class j is 
observed (= number of (?,j) instances). It has been 
shown that using such partially observed instances can 
improve classification accuracy | LR87 |. We make the 
common assumption that the missing-data mechanism 
is ignorable (missing at random and distinct) [LR87], 
i.e. the probability distribution of class labels j of in- 
stances with missing feature i is assumed to coincide 
with the marginal 7r+j . 

The probability of a specific data set D of size N = 
n + n + ? with contingency table N 
hence, is p(D|7r, n, = Yl^ 7r™ 
a uniform prior p(ir) ~ 6(tt ++ — 1) Bayes' rule leads 
to the posterior p(ir\N) - fly Ty* Hi K+ 5 (^++ - l )- 
The mean and variance of I in leading order in iV _1 
can be shown to be 



= {ny,rii?} given it, 
O-i 7r i+ ? - Assuming 



E[tt] = I(tt) + 0(N- 1 ), 



G 



22000 




Instance number 

Figure 4: Comparison of the prediction accuracies of 
the naive Bayes with filters F and FF on the Chess 
data set. The gray area denotes differences that are 
not statistically significant. 



Var[/] = -[K- J 2 /Q-P] + 0(N~ 2 ), 



where 



pij = N—, pi? = A* , 
m 



N n 
Pi? 



Pi? + Pi+ ' 



K = 



*7? V k 1+ ti +j ) Pi ? 



Ji+Qi 



j 



t+Vi?: 



= Y] Pij log ~~~ 

' IT : i 



7r i+ 7r +j 



The derivation will be given in the journal version 
[HZ02]. Note that for the complete case Uj? = 0, we 



have Ttij = p^ = pi? = oo, Q { ? = 1, J = J, 
K = K, and P = 0, consistently with Preliminary 
experiments confirm that FF outperforms F also when 
feature values are partially missing. 

All expressions involve at most a double sum, hence 
the overall computation time is 0(rs). For the case 
of missing class labels, but no missing features, sym- 
metrical formulas exist. In the general case of missing 
features and missing class labels estimates for 7r have 
to be obtained numerically, e.g. by the EM algorithm 
[CF74] in time 0(#-rs), where # is the number of 
iterations of EM. In JHZ02[ | we derive a closed form ex- 
pression for the covariance of p(7r|N) and the variance 
of / to leading order which can be evaluated in time 



0(s 2 (s + r)). This is reasonably fast, if the number of 
classes is small, as is often the case in practice. Note 
that these expressions converge for N — > oo to the ex- 
act values. The missingness needs not to be small. 

6 CONCLUSIONS 

This paper presented ongoing research on the distri- 
bution of mutual information and its application to 
the important issue of feature selection. In the for- 
mer case, we provide fast analytical formulations that 
are shown to approximate the distribution well also for 
small sample sizes. Extensions are presented that, on 
one side, allow improved approximations of the tails of 
the distribution to be obtained, and on the other, al- 
low the distribution to be efficiently approximated also 
in the common case of incomplete samples. As far as 
feature selection is concerned, we empirically showed 
that a newly defined filter based on the distribution 
of mutual information outperforms the popular filter 
based on empirical mutual information. This result is 
obtained jointly with the naive Bayes classifier. 

More broadly speaking, the presented results are im- 
portant since reliable estimates of mutual information 
can significantly improve the quality of applications, 
as for the case of feature selection reported here. The 
significance of the results is also enforced by the many 
important models based of mutual information. Our 
results could be applied, for instance, to robustly in- 
fer classification trees. Bayesian networks can be in- 
ferred by using credible intervals for mutual informa- 
tion, as proposed by [Kle99|. The well-known Chow 
and Liu's approach ]CL6£ ] to the inference of tree- 
networks might be extended to credible intervals (this 
could be done by joining results presented here and in 



past work [ZafOlQ. 

Overall, the distribution of mutual information 
seems to be a basis on which reliable and effective un- 
certain models can be developed. 
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